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Abstract 

In  a  distributed  system  using  message  logging  and 
checkpointing  to  provide  fault  tolerance,  there  is 
always  a  unique  maximum  recoverable  system  state, 
regardless  of  the  message  logging  protocol  used.  The 
proof  of  this  relies  on  the  observation  that  the  set  of 
system  states  that  have  occurred  during  any  single 
execution  of  a  system  forms  a  lattice,  with  the  sets 
of  consistent  and  recoverable  system  states  as  sublat¬ 
tices.  The  maximum  recoverable  system  state  never 
decreases,  and  if  all  messages  are  eventually  logged, 
the  domino  effect  cannot  occur.  This  paper  presents 
a  general  model  for  reasoning  about  recovery  in  such 
a  system  and,  based  on  this  model,  an  efficient  algo¬ 
rithm  for  determining  the  maximum  recoverable  sys¬ 
tem  state  at  any  time.  This  work  unifies  existing  ap¬ 
proaches  to  fault  tolerance  based  on  message  logging 
and  checkpointing,  and  improves  on  existing  methods 
for  optimistic  recovery  in  distributed  systems,  /  ^ 

(^" ) 

1  Introduction  N 

Message  logging  and  checkpointing  can  be  used  to 
provide  an  effective  fault-tolerance  mechanism  in  a 
distributed  system  in  which  all  process  communica¬ 
tion  is  through  messages.  Each  message  received  by 
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a  process  is  logged  on  stable  storage  [5],  and  each 
process  is  occasionally  checkpointed  to  stable  stor¬ 
age,  but  no  coordination  is  required  between  the 
checkpoints  of  different  processes.  Between  received 
messages,  the  execution  of  each  process  is  assumed  to 
be  deterministic. 

The  protocols  used  for  message  logging  are  typi¬ 
cally  pessimistic.  With  these  protocols,  each  message 
is  synchronously  logged  as  it  is  received,  either  by 
blocking  the  receiver  until  the  message  is  logged  [1,  6], 
or  by  blocking  the  receiver  if  it  attempts  to  send  a  new 
message  before  this  received  message  is  logged  [3]. 
Recovery  based  on  pessimistic  message  logging  is 
straightforward.  A  failed  process  is  restarted  from  its 
last  checkpoint,  and  all  messages  originally  received 
by  this  process  since  the  checkpoint  are  replayed  to  it 
from  the  log  in  the  same  order  as  they  were  received 
before  the  failure.  The  process  reexecutes  based  on 
these  messages  to  its  state  at  the  time  of  the  failure. 
Messages  sent  by  the  process  during  recovery  are  ig¬ 
nored  since  they  are  duplicates  of  those  sent  before 
the  failure. 

On  the  other  hand,  optimistic  protocols  perform 
the  message  logging  asynchronously  [9].  The  receiver 
continues  to  execute  normally,  and  received  messages 
are  logged  later,  for  example  by  grouping  several 
messages  and  writing  them  to  stable  storage  in  a  sin¬ 
gle  operation.  The  receiver  of  a  message  depends  on 
the  state  of  the  sender,  though.  If  the  sender  fails 
and  cannot  be  recovered  (for  example,  because  some 
message  has  not  been  logged),  the  receiver  becomes 
an  orphan  process,  and  its  state  must  be  rolled  back 
during  recovery  to  a  point  before  this  dependency 
was  created.  If  rolling  back  this  process  causes  other 
processes  to  become  orphans,  they  too  must  be  rolled 
back  during  recovery.  The  domino  effect  [7,  8]  is  an 
uncontrolled  propagation  of  such  rollbacks  and  must 
be  avoided  to  guarantee  progress  in  spite  of  failures. 
Recovery  based  on  optimistic  message  logging  must 
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find  the  “most  recent”  combination  of  process  states 
that  can  be  recreated,  such  that  none  of  the  process 
states  is  an  orphan. 

Optimistic  message  logging  protocols  appear  to 
be  desirable  in  systems  in  which  failures  are  rare 
and  failure-free  performance  is  of  primary  concern. 
Since  optimistic  protocols  avoid  synchronization  de¬ 
lays  during  message  logging,  performance  in  the  ab¬ 
sence  of  failures  is  improved.  Although  the  required 
recovery  procedure  is  then  more  complicated,  this 
procedure  is  only  required  when  a  failure  occurs. 

Section  2  of  this  paper  presents  a  general  model 
for  reasoning  about  these  recovery  methods  in  dis¬ 
tributed  systems.  With  this  model,  we  show  that 
there  is  always  a  unique  maximum  recoverable  system 
state,  which  never  decreases,  and  that  if  all  messages 
received  are  eventually  logged,  the  domino  effect  can¬ 
not  occur.  Based  on  this  model,  Section  3  describes 
and  proves  the  correctness  of  our  algorithm  for  deter¬ 
mining  the  maximum  recoverable  system  state  at  any 
time.  The  algorithm  requires  no  additional  messages 
in  the  system,  and  supports  recovery  from  any  num¬ 
ber  of  concurrent  failures,  including  a  total  failure. 
Our  model  and  algorithm  make  no  assumption  of 
the  message  logging  protocol  used;  they  support  both 
pessimistic  and  optimistic  logging  protocols,  although 
pessimistic  protocols  do  not  require  their  full  general¬ 
ity.  Section  4  then  relates  this  work  to  existing  fault- 
tolerance  methods  published  in  the  literature  and  dis¬ 
cusses  the  effect  of  different  message  logging  proto¬ 
cols  on  our  model  and  algorithm.  Finally,  Section  5 
summarizes  the  contributions  of  this  work  and  draws 
some  conclusions. 


All  dependencies  of  any  process  i  can,  therefore, 
be  represented  by  a  dependency  vector 

d,-  =  (6.)  =  (<5i,  62,  S3, . . .  ,6n)  , 

where  n  is  the  total  number  of  processes  in  the  sys¬ 
tem.  Component  j  of  process  t’s  dependency  vector, 
6j,  gives  the  maximum  index  of  any  state  interval 
of  process  j  on  which  process  i  currently  depends. 
Component  i  of  process  i’s  own  dependency  vector 
is  always  set  to  the  index  of  process  t’s  current  state 
interval.  If  process  i  has  no  dependency  on  any  state 
interval  of  some  process  j,  then  6j  is  set  to  ±,  which 
is  less  than  all  possible  state  imerval  indices. 

Processes  cooperate  to  maintain  their  current  de¬ 
pendency  vectors  by  tagging  all  messages  sent  with 
their  current  state  interval  index  and  by  remembering 
in  each  process  the  maximum  index  in  any  message 
received  from  each  other  process.  During  any  single 
execution  of  the  system,  the  dependency  vector  for 
any  process  is  uniquely  determined  by  the  state  in¬ 
terval  index  of  the  process.  No  component  of  the  de¬ 
pendency  vector  of  any  process  can  decrease  through 
normal  execution  of  the  process. 

2.2  System  States 

The  state  of  the  system  is  the  composition  of  the 
states  of  all  component  processes  of  the  system  and 
may  be  represented  by  an  n  x  n  dependency  matrix. 
Taking  the  dependency  vector,  d,  ,  of  each  process  i 
in  the  system,  the  dependency  matrix 


2  The  Model 

2.1  Process  States 

Each  time  a  process  receives  an  input  message,  it  be¬ 
gins  a  new  state  interval ,  a  deterministic  sequence  of 
execution  based  only  on  the  state  of  the  process  at 
the  time  that  the  message  is  received  and  on  the  con¬ 
tents  of  the  message  itself.  Within  each  process,  each 
state  interval  is  identified  by  a  unique  sequential  state 
interval  index,  which  is  simply  a  count  of  the  number 
of  input  messages  that  the  process  has  received. 

All  dependencies  of  a  process  i  on  some  process 
j  can  be  encoded  simply  as  the  maximum  index  of 
any  state  interval  of  process  j  on  which  process  i  de¬ 
pends.  This  encoding  is  possible  since  the  execution 
of  a  process  within  each  state  interval  is  deterministic 
and  since  any  state  interval  in  a  process  naturally  also 
depends  on  all  previous  intervals  of  the  same  process. 


D  =  [6, .]  = 
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dependency  vector  for  process  ».  Since  component  i 
of  process  i's  dependency  vector  is  always  the  index  of 
process  i’s  current  state  interval,  the  diagonal  of  the 
dependency  matrix,  6n,  1  <  i  <  n,  shows  the  current 
state  interval  index  of  each  process  in  the  system. 

Let  S  be  the  set  of  all  system  states  that  have  oc¬ 
curred  during  any  single  execution  of  some  system. 
The  set  S  forms  a  partial  order  called  the  system 
history  relation,  in  which  one  system  state  precedes 
another  if  and  only  if  it  must  have  occurred  first  dur¬ 
ing  the  execution.  This  relation  can  be  expressed  in 
terms  of  the  state  interval  index  of  each  process  as 
shown  in  the  dependency  matrix. 
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Definition  1  If  A  =  [a. .]  and  B  =  [/?..] 
are  system  states  in  S,  then 

A  X  B  <=>  V  i  [a,,-  <  /?j<]  . 

This  partial  order  differs  from  that  defined  by 
Lamport’s  happened  before  relation  [4]  in  that  it  or¬ 
ders  the  system  states  that  result  from  events  rather 
than  the  events  themselves,  and  that  only  state  inter¬ 
vals  (started  by  the  receipt  of  a  message)  constitute 
events. 

For  example,  Figure  1  shows  a  system  of  four  com¬ 
municating  processes.  The  horizontal  lines  represent 
the  execution  of  each  process,  each  arrow  represents 
a  message  from  one  process  to  another,  and  the  num¬ 
bers  give  the  index  of  the  state  interval  started  by  the 
receipt  of  each  message.  Consider  the  two  possible 
system  states  A  and  B,  where  in  state  A,  message 
a  has  been  received  but  message  b  has  not,  and  in 
state  B,  message  b  has  been  received  but  message  a 
has  not.  These  states  can  be  expressed  by  the  depen¬ 
dency  matrices 
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States  A  and  B  are  incomparable  under  the  system 
history  relation,  which  can  be  seen  by  comparing  the 
circled  values  on  the  diagonals  of  these  two  depen¬ 
dency  matrices. 


2.3  The  System  History  Lattice 

A  system  state  describes  the  set  of  messages  that  have 
been  received  by  each  process.  Two  system  states  in 
S  can  be  combined  to  form  their  union  such  that 
each  process  has  received  all  of  the  messages  that  it 
has  in  either  of  the  two  original  system  states.  This 
can  be  expressed  in  terms  of  the  dependency  matrices 
describing  these  system  states  by  choosing  for  each 
process  the  row  that  has  the  largest  state  interval 

Process  1  — 

Process  2  — 


Process  3  — 

Process  4  — 

Figure  1  The  system  history  partial  order 
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index  of  the  corresponding  rows  in  the  original  ma¬ 
trices. 

Definition  2  If  A  =  [a..]  and  B  =  [/?..] 
are  system  states  in  S,  then  the  union  of  A 
and  B  is  A  U  B  =  [7..), 
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Likewise,  the  intersection  of  two  system  states  in  S 
can  be  formed  such  that  each  process  has  received 
only  those  messages  that  it  has  in  both  of  the  two 
original  system  states.  This  can  be  formed  from 
the  dependency  matrices  describing  these  states  by 
choosing  for  each  process  the  row  that  has  the  small¬ 
est  state  interval  index  of  the  corresponding  rows  in 
the  original  matrices. 

Definition  3  If  A  =  [o, .]  and  B  =  [/?*.] 
are  system  states  in  S,  then  the  intersection 
of  A  and  B  is  A  D  B  =  [6. .], 


f  OTj,  if  Otit  <  pa 

\  Pi .  otherwise 


Continuing  the  example  of  Section  2.2  in  Figure  1, 
the  union  and  intersection  of  states  A  and  B  can  be 
formed  by  choosing  the  proper  rows  from  these  two 
matrices  to  get 
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The  following  theorem  introduces  the  system  his~ 
tory  lattice  formed  by  the  set  of  system  states  that 
have  occurred  during  any  single  execution  of  some 
system,  ordered  by  the  system  history  relation. 
Theorem  1  The  set  S,  ordered  by  the  sys¬ 
tem  history  relation,  forms  a  lattice.  For  any 
A,B  £  S,  the  least  upper  bound  of  A  and 
B  is  A  U  B,  and  the  greatest  lower  bound  of 
A  and  B  is  A  n  B. 

Proof  Straightforward  from  the  construction  of  sys¬ 
tem  state  union  and  intersection  in  Definitions  2 
and  3.  □ 


2.4  Consistent  System  States 

A  system  state  is  consistent  if  and  only  if  all  messages 
received  by  all  processes  have  either  already  been 
sent  in  the  state  of  the  sending  process  or  can  de¬ 
terministically  be  sent  by  that  process  in  the  future. 
Since  process  execution  within  a  state  interval  is  de¬ 
terministic,  any  message  sent  before  the  end  of  the 
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current  state  interval  can  deterministically  be  sent, 
but  messages  sent  after  this  cannot  be.  Only  a  con¬ 
sistent  system  state  would  be  possible  to  be  reached 
through  normal  execution  of  the  system  from  its  ini¬ 
tial  state,  if  an  instantaneous  snapshot  of  the  entire 
system  could  be  observed  [2] . 

Any  messages  shown  by  the  system  state  to  be 
sent  but  not  yet  received  do  not  cause  the  system 
state  to  be  inconsistent.  These  messages  can  be  han¬ 
dled  by  the  normal  mechanism  for  reliable  message 
delivery,  if  any,  used  by  the  underlying  system.  In 
particular,  suppose  such  a  message  m  was  received 
by  some  process  i  after  the  state  of  process  i  was  ob¬ 
served  to  form  the  system  state,  and  suppose  process 
i  then  sent  some  message  n  (such  as  an  acknowledge¬ 
ment  of  message  m),  which  could  show  this  receipt. 
If  message  n  has  been  received  in  this  system  state, 
the  state  will  be  inconsistent  because  message  n  (not 
message  m)  is  shown  as  having  been  received  but  not 
yet  sent.  If  message  n  has  not  been  received  yet,  no 
effect  of  either  message  can  be  seen  in  the  system 
state,  and  it  is  thus  still  consistent. 

If  a  system  state  is  consistent,  then  no  process 
depends  on  a  state  interval  of  the  sender  greater  than 
the  sender’s  current  state  interval  in  the  dependency 
matrix.  For  each  column  j  of  the  dependency  matrix, 
no  element  in  that  column  may  be  larger  than  the 
element  on  the  diagonal  of  the  matrix. 

Definition  4  If  D  =  [6m  .]  is  some  system 
state  in  S,  D  is  consistent  if  and  only  if 


matrix 


D  =  [«..]  = 
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This  system  state  is  not  consistent  since  process  1  has 
received  a  message  (to  begin  state  interval  1)  from 
process  2  that  has  not  been  sent  yet  by  process  2 
and  cannot  be  deterministically  sent  in  the  future. 
This  inconsistency  is  shown  in  the  dependency  matrix 
since  612  is  greater  than  62  2- 

Lemma  1  The  set  C  forms  a  sublattice  of 

the  system  history  lattice. 

Proof  It  suffices  to  show  that  for  any  A,B  £  C 
A  U  B  G  C  and  A  (~i  B  C  T.et  A  =  fo.  ,1  and 
B  =  [/3, ,]. 

(A  U  B  G  C):  Let  C  =  [7.  »]  =  A  U  B.  In  each 
column  j  of  C,  either  7 ij  <  aj  j  or  7 <  j  <  0j  j  for  all », 
since  A  G  C  and  B  G  C.  Since  7 J;-  =  max(a;-  j ,  0j  j), 
7 ij  <  7 jj  for  all  i  as  well.  Therefore,  A  U  B  G  C. 

(A  n  B  G  C):  Let  D  =  [$..]  =  A  n  B.  By 
Definition  3  and  since  no  element  in  the  dependency 
vector  for  any  process  ever  decreases  as  the  process 
executes  6ij  =  min (Qij,0ij),  for  all  i  and  j.  This 
implies  that  6ij  <  a and  6ij  <  0ij.  Since  A  and 
B  are  consistent,  <  a;  j  and  j-  <  0j  j.  Combin¬ 
ing  this  with  the  previous  result  yields  <5t  J  <  aj;  and 
&ij  <  0j  j  ■  This  implies  that  <  min(a;  j,0jj),  and 
thus  8,  j  <  Sjj,  for  all  i  and  j.  Therefore,  A  D  B  G  C. 
□ 


V  *1  j  [^i  j  <  Sj  j]  . 

Let  the  set  C  C  S  be  the  set  of  consistent  system 
states  that  have  occurred  during  any  single  execution 
of  some  system.  Thus, 

C={DG«S|Dis  consistent }  . 

For  example,  consider  the  system  of  thr 
processes  whose  execution  is  shown  in  Figure  2.  The 
state  of  each  process  here  is  observed  where  the  curve 
crosses  the  execution  line  for  that  process,  and  the  re¬ 
sulting  system  state  is  represented  by  the  dependency 


2.5  Message  Logging  and 
Checkpointing 

A  message  is  called  logged  if  and  only  if  its  data  and 
the  index  of  the  state  interval  that  it  started  in  its 
receiver  process  are  both  recorded  on  stable  storage. 
The  predicate  logged(i,cr )  is  true  if  and  only  if  the 
message  that  started  state  interval  <x  in  process  i  is 
logged. 

The  predicate  checkpoint^,  cr)  is  true  if  and  only 
if  there  exists  a  checkpoint  on  stable  storage  that 
records  the  state  of  process  i  in  state  interval 
cr.  When  a  process  is  created,  it  is  immediately 
checkpointed  before  it  begins  execution,  and  thus, 
checkpoint^',  0)  is  true  for  all  processes  i. 

For  every  state  interval  a  of  some  process,  there 
must  be  some  checkpoint  on  stable  storage  for  that 
process  with  a  state  interval  index  no  larger  than  <r, 
since  there  is  at  least  always  a  checkpoint  on  stable 
storage  for  state  interval  0. 

Definition  5  The  effective  checkpoint  for 
a  state  interval  <7  of  some  process  i  is  the 
checkpoint  on  stable  storage  for  process  » 


with  the  largest  state  interval  index  e  such 
that  e  <ff. 

A  state  interval  of  a  process  is  called  stable  if  and 
only  if  all  messages  received  by  the  process  to  start 
state  intervals  after  its  effective  checkpoint  are  logged. 
The  predicate  stable(i,a)  is  true  if  and  only  if  state 
interval  a  of  process  i  is  stable. 

Definition  6  If  a  is  the  state  interval  in¬ 
dex  for  some  process  i,  and  if  e  is  the  state 
interval  index  of  the  effective  checkpoint  for 
state  interval  <r  of  process  i,  then  state  in¬ 
terval  <r  of  process  i  is  stable  if  and  only  if 

V  a,  e  <  a  <  <r  [  logged(i,  a)  ]  . 

Any  stable  state  interval  <r  for  a  process  can  be 
recreated  by  restoring  the  process  from  the  effective 
checkpoint  (with  state  interval  index  c)  and  replay¬ 
ing  to  it  in  order  any  logged  messages  to  begin  state 
intervals  e+1  through  <7. 

The  checkpoint  of  a  process  includes  the  com¬ 
plete  current  dependency  vector  for  the  process.  Each 
logged  message,  though,  only  gives  the  single  depen¬ 
dency  created  in  the  receiver  by  this  message.  The 
complete  dependency  vector  for  any  stable  state  in¬ 
terval  of  some  process  is  always  known,  though,  since 
all  messages  that  started  state  intervals  since  the  ef¬ 
fective  checkpoint  must  be  logged. 

2.6  Recoverable  System  States 

A  system  state  is  called  recoverable  if  and  only  if  all 
component  process  states  are  stable  and  the  resulting 
system  state  is  consistent.  To  recover  the  state  of  the 
system,  it  must  be  possible  to  recover  the  states  of  the 
component  processes,  and  for  this  system  state  to  be 
meaningful,  it  must  be  possible  to  have  reached  this 
state  through  normal  execution  of  the  system  from 
its  initial  state. 

Definition  7  If  D  =  [£* .]  is  some  system 
state  in  S,  D  is  recoverable  if  and  only  if 

DgC  A  V i  [stable(i, 6i <) ]  . 

Let  the  set  H  C  S  be  the  set  of  recoverable  system 
states  that  have  occurred  during  any  single  execution 
of  some  system.  Thus, 

7£  =  {D€<S|Dis  recoverable }  . 

Since  only  consistent  system  states  can  be  recover¬ 
able,  ncccs. 

Lemma  2  The  set  71  forms  a  sublattice  of 
the  system  history  lattice. 


Proof  For  any  A,  B  €  72,  AuB  €  C  and  AnB  £  C, 
by  Lemma  1.  Since  the  state  of  each  process  in  A  and 
B  is  stable,  all  process  states  in  AUB  and  AnB  are 
stable  as  well.  Thus,  AuB  €  71  and  AnB  e  H,  and 
71  forms  a  sublattice.  □ 

2.7  The  Current  Recovery  State 

In  recovering  after  a  failure,  we  wish  to  restore  the 
state  of  the  system  to  the  “most  recent”  recoverable 
state  that  is  possible  from  the  information  available, 
in  order  to  minimize  the  amount  of  reexecution  nec¬ 
essary  to  complete  the  recovery.  The  system  history 
lattice  corresponds  to  this  notion  of  time,  and  the  fol¬ 
lowing  theorem  establishes  the  existence  of  a  single 
maximum  recoverable  state  under  this  ordering. 
Theorem  2  There  is  always  a  unique 
maximum  recoverable  system  state  in  S. 

Proof  71  C  S,  and  by  Lemma  2,  A  U  B  6  71  for  any 
A,  B  G  71.  Since  A  <  A  U  B  and  B  <  A  U  B,  the 
unique  maximum  in  S  is  simply 

U  D, 

Deft 

which  must  be  unique  since  TZ  forms  a  sublattice  of 
the  system  history  lattice.  □ 

The  maximum  recoverable  system  state  at  any 
time  is  called  its  current  recovery  state.  The  following 
lemma  shows  that  the  current  recovery  state  of  the 
system  never  decreases. 

Lemma  3  If  the  current  recovery  state 
of  the  system  is  R  =  [p*.],  then,  for  each 
process  i,  the  system  can  always  be  recov¬ 
ered  without  needing  to  roll  back  any  state 
interval  a  <  pa. 

Proof  R  will  always  remain  consistent,  and  for  each 
process  i,  state  interval  pt ,  will  always  remain  stable. 
Since  71  forms  a  sublattice,  any  new  current  recovery 
state  established  after  R  must  be  greater  than  R  in 
the  lattice.  By  Definition  1,  this  implies  that  the  state 
interval  index  for  each  process  in  any  new  current 
recovery  state  must  be  greater  than  or  equal  to  pa. 
Therefore,  for  each  process  i,  no  state  interval  <r  <  pa 
will  ever  need  to  be  rolled  back.  □ 

Corollary  1  If  all  messages  received  by 
executing  processes  are  eventually  logged, 
there  is  no  possibility  of  the  domino  effect 
in  the  system. 

Proof  If  all  messages  are  eventually  logged,  all  state 
intervals  of  all  processes  eventually  become  stable  by 
Definition  6,  and  thus  new  recoverable  states  must 
become  possible  through  Definition  7.  By  Lemma  3, 
these  states  will  never  need  to  be  rolled  back.  □ 


2.8  Committing  Output 

If  some  state  interval  of  a  process  must  be  rolled 
back  to  recover  a  consistent  system  state,  any  out¬ 
put  messages  seut  wliile  that  state  interval  is  being 
reexecuted  after  recovery  may  not  be  the  same  as 
those  originally  sent.  Any  processes  that  received 
such  messages  will  be  orphans  and  must  also  be  rolled 
back  to  a  point  before  these  messages  were  received. 

However,  messages  sent  to  the  outside  world,  such 
as  those  to  the  user’s  display  terminal,  cannot  be 
treated  in  the  same  way.  Since  the  outside  world 
generally  cannot  be  rolled  back,  any  messages  sent  to 
the  outside  world  must  be  delayed  until  it  is  known 
that  the  state  interval  from  which  they  were  sent  will 
never  need  to  be  rolled  back,  at  which  time  they  may 
be  committed  by  releasing  them.  This  theorem  estab¬ 
lishes  when  it  is  safe  to  commit  an  output  message 
sent  to  the  outside  world. 

Corollary  2  If  the  current  recovery  state 
of  the  system  is  R  =  [p, .],  then  any  message 
sent  by  a  process  i  from  a  state  <r  <  pa  may 
be  committed. 

Proof  Follows  directly  from  Lemma  3.  □ 


2.9  Garbage  Collection 

While  the  system  is  operating,  checkpoints  and 
logged  messages  accumulate  on  stable  storage  in  case 
they  are  needed  for  some  future  recovery.  This  data 
may  be  removed  from  stable  storage  whenever  doing 
so  will  not  interfere  with  the  ability  of  the  system  to 
recover  as  needed.  The  following  two  theorems  estab¬ 
lish  when  this  can  safely  be  done. 

Corollary  3  Let  R  =  [p,  „]  be  the  cur¬ 
rent  recovery  state.  For  each  process  i,  if 
e,  is  the  state  interval  index  of  the  effective 
checkpoint  for  its  state  interval  p,  * ,  then  any 
checkpoint  for  process  i  with  state  interval 
index  <r  <  e,  may  be  released  from  stable 
storage. 

Proof  Follows  directly  from  Lemma  3.  □ 

Corollary  4  Let  R  =  [p. ,]  be  the  cur¬ 
rent  recovery  state.  For  each  process  »',  if 
e*  is  the  state  interval  index  of  the  effec¬ 
tive  checkpoint  for  its  state  interval  p,  ,• ,  then 
any  message  that  begins  a  state  interval  in 
process  i  with  index  <r  <  e<  may  be  released 
from  stable  storage. 

Proof  Follows  directly  from  Lemma  3.  □ 


3  Recovery  State  Algorithm 

3.1  Introduction 

As  the  system  executes,  new  logged  messages  and 
checkpoints  arrive  on  stable  storage.  Occasionally, 
some  combination  of  this  information  may  define  a 
new  current  recovery  state  by  creating  a  new  recov¬ 
erable  system  state  greater  than  the  existing  current 
recovery  state.  Theorem  2  of  Section  2  established 
that  there  is  always  a  unique  maximum  recoverable 
state  at  any  time.  Conceptually,  this  state  may  be 
found  by  an  exhaustive  search  of  all  combinations  of 
stable  process  state  intervals  until  the  maximum  com¬ 
bination  is  found.  However,  such  a  search  would  be 
too  expensive  in  practice,  and  an  effective  means  of 
limiting  this  search  space  is  important. 

The  recovery  state  algorithm  monitors  check¬ 
points  and  logged  messages  as  they  arrive  on  stable 
storage  and  decides  if  each  allows  an  advance  in  the 
current  recovery  state  of  the  system.  The  algorithm 
is  invoked  each  time  a  process  state  interval  becomes 
stable;  it  is  incremental  in  that  it  only  examines  in¬ 
formation  that  has  changed  since  its  last  execution, 
rather  than  recomputing  the  entire  current  recovery 
state  on  each  execution.  Since  it  only  uses  informa¬ 
tion  on  stable  storage,  it  can  handle  any  number  of 
concurrent  process  failures. 

3.2  The  Basic  Algorithm 

Each  time  some  new  state  interval  a  of  some  process  k 
becomes  stable,  the  algorithm  attempts  to  form  a  new 
current  recovery  state  in  which  the  state  of  process  k 
is  advanced  to  state  interval  <x.  It  does  so  by  includ¬ 
ing  any  state  intervals  from  other  processes  that  are 
necessary  to  make  this  new  system  state  consistent. 
The  check  for  consistency  is  performed  by  a  direct 
application  of  the  definition  of  system  state  consis¬ 
tency  from  Section  2.  The  algorithm  succeeds  if  all 
such  state  intervals  included  are  stable,  making  this 
new  consistent  system  state  composed  entirely  of  sta¬ 
ble  process  state  intervals.  Otherwise,  no  new  current 
recovery  state  is  possible. 

An  outline  of  the  basic  recovery  state  algorithm 
is  shown  below.  Some  details  are  omitted  from  this 
outline  for  clarity;  these  will  be  discussed  later  after 
the  basic  algorithm  is  described.  Let  R  =  [p.  ,j  be 
the  current  recovery  state  of  the  system.  When  state 
interval  <r  of  process  k  becomes  stable,  the  following 
steps  are  taken  by  the  algorithm: 

1.  If  cr  <  pkk,  then  exit  the  algorithm,  since  the 
current  recovery  state  is  already  in  advance  of 
state  interval  <r  of  process  k. 


2.  Make  a  new  dependency  matrix  D  =  (£. .]  from 
R,  with  row  k  replaced  by  the  dependency  vector 
for  state  interval  <r  of  process  k. 

3.  Loop  on  step  3  while  D  is  not  consistent.  That 
is,  loop  while  there  exists  some  i  and  j  for  which 
Si  j  >  Sj  j ,  which  shows  that  some  process  i  de¬ 
pends  on  a  state  interval  of  process  j  greater  than 
process  j’s  current  state  interval  in  D. 

Find  a  stable  state  interval  a  >  Sij  of  process  j. 
If  state  interval  Sij  is  stable,  let  a  be  Sij ;  other¬ 
wise,  choose  some  later  state  interval  of  process 
j  for  a,  if  one  exists: 

(a)  If  no  such  state  interval  exists  for  a  that  is 
stable,  exit  the  algorithm,  but  remember  to 
recheck  this  later. 

(b)  Otherwise,  replace  row  j  of  D  with  the 
dependency  vector  for  state  interval  a  of 
process  j. 

4.  The  dependency  matrix  D  is  now  consistent  and 
composed  only  of  stable  process  state  intervals. 
It  is  thus  recoverable.  Replace  R  with  this  new 
system  state  D,  making  it  the  new  current  re¬ 
covery  state. 

3.3  Some  Details 

Lemma  4  The  state  interval  a  chosen  for 
process  j  during  each  iteration  in  step  3 
must  be  the  minimum  a  >  Sij  that  is  stable. 
Proof  As  a  process  executes,  no  element  of  its  de¬ 
pendency  vector  can  decrease.  Thus,  the  dependen¬ 
cies  of  any  state  interval  of  process  j  after  this  min¬ 
imum  a  will  be  at  least  as  large  as  those  of  state 
interval  a.  Clearly,  if  state  interval  Sij  is  stable,  its 
dependencies  will  be  exactly  only  those  that  are  nec¬ 
essary;  any  later  state  interval  of  process  j  may  have 
additional  dependencies  that  state  interval  Sij  does 
not  have.  Using  the  minimum  set  of  dependencies 
possible  with  the  stable  process  states  that  are  avail¬ 
able  will  restrict  the  solutions  the  least.  □ 
Lemma  5  The  comparisons  in  step  3  to 
check  if  D  is  a  consistent  system  state  may 
be  made  in  any  order  without  affecting  the 
final  resulting  dependency  matrix. 

Proof  Since  the  only  change  made  to  D  during  the 
loop  of  step  3  is  the  replacement  of  row  j  with  the 
dependency  vector  for  state  interval  a,  the  only  effect 
that  the  order  of  these  comparisons  has  is  the  order 
in  which  these  row  replacements  are  performed. 

First,  each  replacement  of  row  j  can  only  increase 
Sj  j ,  since  row  j  is  only  replaced  when  Sij  >  Sjj,  and 
the  new  dependency  vector  for  that  row  is  always 
chosen  such  that  its  state  interval  a  >  Sij  >  Sjj. 


Second,  any  row  replacements  required  by  the  re¬ 
placement  of  some  row  j  will  still  be  required  after 
the  replacement  of  row  j',  j 1  j ,  unless  row  j  also 
required  the  replacement  of  row  j',  and  the  state  in¬ 
terval  index  of  the  new  row  j'  is  greater  than  that  re¬ 
quired  by  row  j,  in  which  case  this  new  row  j‘  would 
still  be  required  if  row  j’s  requirement  had  been  met 
first. 

Thus,  the  dependency  vector  left  in  each  row  of  D 
when  the  algorithm  terminates  will  always  have  the 
maximum  state  interval  index  of  any  vector  placed  in 
that  row  during  the  l^op  of  step  3,  regardless  of  the 
order  that  the  row  replacements  are  made.  □ 

Lemma  6  When  state  interval  <r  of 
process  k  becomes  stable,  the  basic  algo¬ 
rithm  finds  some  recoverable  system  state 
D  =  [6, .]  with  Si  I,  =  (7,  if  any  such  system 
state  exists. 

Proof  Any  system  state  found  must  be  recoverable 
since  only  stable  process  state  intervals  are  included 
by  the  algorithm,  and  the  resulting  system  state  is 
checked  for  consistency.  As  each  row  of  D  is  replaced, 
the  dependencies  that  must  be  satisfied  grow  as  little 
as  possible  with  the  stable  process  states  that  are 
available,  as  shown  in  Lemma  4.  Since  the  state  in¬ 
terval  for  any  process  used  in  D  never  decreases  as 
the  algorithm  executes,  the  state  interval  for  process 
k  in  any  recoverable  state  found  will  never  be  less 
than  <r.  If  the  algorithm  finds  that  it  needs  any  state 
interval  of  process  k  greater  than  cr,  no  recoverable 
state  is  possible,  since  the  fact  that  state  interval  a 
of  process  k  is  now  stable  has  no  effect  on  such  a 
system  state,  and  any  such  recoverable  state  that  ex¬ 
ists  would  have  already  been  found  by  some  earlier 
execution  of  the  algorithm  □ 

Lemma  7  No  stable  process  state  inter¬ 
val  that  was  deferred  in  step  3a  needs  to  be 
rechecked  until  step  4  advances  the  current 
recovery  state. 

Proof  Suppose  some  state  interval  a  of  process  k 
becomes  stable  and  the  algorithm  determines  that  no 
new  recoverable  state  is  possible.  By  Lemma  6,  this 
means  that  no  consistent  set  of  stable  process  state 
intervals  A  =  [a..]  is  available  with  an  =  <r. 

Now  suppose  some  new  state  interval  a'  of  process 
k'  becomes  stable.  If  the  algorithm  determines  that 
no  new  recoverable  state  is  possible,  there  is  no  con¬ 
sistent  set  of  stable  process  state  intervals  B  =  [/?. .] 
available  with  /?*/ 1>  =  <r'.  The  only  effect  that  this 
new  state  interval  cr'  of  process  k’  can  have  on  the  ear¬ 
lier  evaluation  of  state  interval  <r  of  process  k  is  that 
some  recoverable  state  may  now  be  possible  with  the 
state  interval  index  of  process  k'  set  to  cr',  but  the 


* 


algorithm  has  already  determined  that  no  such  re¬ 
coverable  state  is  possible.  Thus,  there  is  no  need  to 
recheck  any  earlier  deferred  stable  process  states  in 
this  case.  □ 

This  lemma  shows  when  it  is  necessary  to  recheck 
any  deferred  stable  state  intervals.  It  also  gives  a 
method  to  greatly  limit  the  set  of  those  deferred  sta¬ 
ble  state  intervals  that  need  to  be  rechecked,  rather 
than  rechecking  all  such  state  intervals  that  are  not 
yet  included  in  the  current  recovery  state. 

Corollary  5  When  the  current  recovery 
state  advances  from  R  =  [p,.]  to  some  new 
state  R'  =  [/>'..],  the  stable  process  states 
that  were  deferred  earlier  by  step  3a  and 
should  now  be  rechecked  are  those  with  a 
direct  dependency  on  some  state  interval  a 
of  any  process  i  such  that  pa  <  a  <  p\  i . 

Proof  The  proof  follows  directly  from  the  proof  of 
Lemma  7.  □ 

3.4  Correctness 

Theorem  3  The  recovery  state  algorithm 
always  finds  the  current  recovery  state  of  the 
system. 

Proof  First,  by  Lemma  6,  the  algorithm  only  finds 
recoverable  system  states.  Also,  any  such  system 
states  found  will  be  greater  than  the  previous  current 
recovery  state  since  at  least  the  new  state  interval 
<r  for  process  k  is  always  greater  than  the  previous 
state  interval  index  for  process  k  in  the  current  re¬ 
covery  state.  Lemma  6  also  shows  that  if  some  new 
recoverable  state  can  be  formed  when  state  interval  a 
of  process  k  becomes  stable,  the  algorithm  finds  one. 
Lemma  7  shows  when  it  is  necessary  to  recheck  any 
process  state  interval  that  could  not  be  added  to  a 
new  current  recoverable  state  when  it  became  stable, 
and  Corollary  5  shows  which  state  intervals  should  be 
rechecked  then.  By  rechecking  all  those  state  inter¬ 
vals  at  the  correct  times,  the  maximum  recoverable 
state  must  be  found.  □ 

3.5  An  Efficient  Procedure 

The  algorithm  described  in  Sections  3.2  and  3.3  can 
be  implemented  efficiently  by  making  some  observa¬ 
tions  about  the  execution  of  the  algorithm,  based  on 
Lemma  5. 

When  step  3  examines  the  dependency  matrix  D 
on  each  iteration,  there  may  be  many  pairs  of  i  and 
j  for  which  6ij  >  6jj,  indicating  that  several  differ¬ 
ent  rows  of  the  matrix  need  to  be  replaced.  These 
required  row  replacements  can  be  entered  into  a  list 
of  pending  replacements  as  each  is  discovered.  Since 


initially  only  row  k  of  D  has  been  changed  from  the 
current  recovery  state,  and  since  on  each  iteration, 
only  one  row  is  replaced  at  a  time,  only  the  single 
changed  row  needs  to  be  compared  against  the  diag¬ 
onal  elements  of  D  for  consistency.  Then,  only  the 
diagonal  elements  of  the  matrix  are  needed  during  the 
execution  of  the  algorithm.  The  list  of  pending  row 
replacements  only  needs  to  remember  the  maximum 
index  of  any  state  interval  needed  for  each  process, 
since  the  dependency  vector  that  the  algorithm  leaves 
in  each  row  of  D  is  the  one  for  that  process  with  the 
maximum  state  interval  index,  regardless  of  the  order 
that  the  replacements  are  performed. 

The  function  FIND.RV,  shown  in  Figure  3,  is  a 
procedure  to  to  implement  steps  1  through  3  of  the 
basic  algorithm,  taking  advantage  of  these  observa¬ 
tions.  This  procedure  finds  a  new  recoverable  state 
based  on  state  interval  a  of  process  k,  if  such  a  state 
exists.  The  list  of  pending  row  replacements  is  main¬ 
tained  in  NEED,  such  that  NEED[t\  is  always  the 
maximum  index  of  any  state  interval  in  process  i  that 
is  currently  needed  to  replace  row  i  of  the  matrix.  If 
no  row  replacements  are  currently  needed  for  some 
process  j,  then  NEED\j]  is  set  to  _L.  A  vector,  RV, 
is  used  instead  of  the  full  dependency  matrix,  where 
is  diagonal  element  i  of  the  corresponding  de¬ 
pendency  matrix,  which  is  also  the  state  interval  in¬ 
dex  of  process  i  in  the  recoverable  state.  As  each  row 
is  replaced,  only  the  corresponding  single  element  of 
RV  is  changed. 


function  FIND.RV(RV,  k,  a) 
if  <r  <  /2 V[ifc]  then  return  true; 
for  i  <—  1  to  n  do  NEED[i]  «—  i_; 

RV[k]~<r-, 
for  i  <—  1  to  n  do 

if  DVk[i)  >  then  NEED [»]  —  DVk[i]\ 
while  3  i  such  that  NEED[i]  ^  L  do 
a  <—  minimum  such  that 

a  >  NEED[i]  and  stable(i,  a); 
if  no  such  a  then  return  false; 
rtF[i]  —  a; 

NEED[i\  —  1; 
for  j  «—  1  to  n  do 
if  DVf\j]  >  RV\j]  then 

NEED\j]  -  mzx(NEED[j},  DVf\J)y, 
return  true; 


Figure  3  Finding  a  new  recoverable  state 


Using  function  FIND.RV ,  the  full  recovery  state 
algorithm  can  now  be  stated.  This  algorithm,  shown 
in  Figure  4,  initially  calls  FIND.RV  on  the  state  in¬ 
terval  that  just  became  stable.  If  no  new  recoverable 
state  is  found,  the  algorithm  exits  since  no  change  in 
the  current  recovery  state  is  possible  from  this  new 
stable  state.  If  FIND.RV  returns  success,  the  result 
becomes  the  new  current  recovery  state,  and  the  al¬ 
gorithm  checks  if  any  other  recoverable  states  greater 
than  this  result  can  now  exist.  The  sets  DEFER ] 
keep  track  of  those  deferred  stable  process  state  in¬ 
tervals  that  should  be  rechecked  when  the  current  re¬ 
covery  state  advances  over  state  interval  f3  of  process 
j.  The  set  WORK  keeps  a  list  of  those  deferred  states 
that  are  to  be  rechecked  by  the  algorithm  because  the 
current  recovery  state  has  been  advanced. 

4  Related  Work 

A  number  of  fault-tolerance  recovery  methods  based 
on  message  logging  and  checkpointing  have  been  pub¬ 
lished  in  the  literature.  This  includes  ones  using  pes¬ 
simistic  logging  protocols  such  as  Auros  [1],  Publish¬ 
ing  [6],  and  sender-based  message  logging  [3],  as  well 
as  optimistic  methods  [9].  The  model  and  recovery 
state  algorithm  presented  in  Sections  2  and  3  can  be 
applied  to  each  of  these  and  used  to  reason  about 
their  correctness. 


WORK  —  {(£,<?)}, 
while  WORK  ^(Jdo 

remove  some  state  ( x,6 )  from  WORK ; 
if  6  >  Cf?5[x]  then 
for  j  <—  1  to  n  do 

NEWCRS[j]  <-  CRS\j]; 
if  FIND.R V ( NE WCRS ,  x ,  0)  =  true 
then 

for  j  *—  1  to  n  do 

for  0  —  CRS\j }  +  1  to  NEWCRS\j]  do 
WORK  «-  WORK  U  DEFER J; 
DEFER ]  <—  0; 

CRS\j]  -  NEWCRS{j\; 

else 

for  j  «—  1  to  n  do 

0 «-  dv'm 

if  0  >  CRS{j]  then 
DEFER ]  —  DEFER.]  U  {  (x,  0)  }; 


Figure  4  The  recovery  state  algorithm 


Our  model  is  more  general  than  is  required  by 
recovery  methods  based  on  pessimistic  message  log¬ 
ging,  but  the  definitions  of  consistency,  stability,  and 
recoverability  still  apply,  and  the  recovery  state  al¬ 
gorithm  still  computes  the  correct  current  recovery 
state.  In  this  case,  the  current  recovery  state  is  iden¬ 
tical  to  the  state  of  the  system  at  the  time  the  fail¬ 
ure  occurred,  since  orphan  processes  are  not  possible. 
Since  message  logging  is  synchronous,  however,  a  sim¬ 
pler  recovery  state  algorithm  is  possible  that  takes 
advantage  of  the  order  that  information  arrives  on 
stable  storage.  In  particular,  checkpoints  never  add 
new  information  for  the  algorithm,  since  messages  are 
always  logged  in  ascending  order  by  the  index  of  the 
state  interval  that  they  start  in  their  receivers,  and 
all  messages  received  before  a  checkpoint  have  already 
been  logged  before  the  checkpoint  can  be  recorded. 

Recovery  based  on  optimistic  logging  protocols  re¬ 
quires  the  full  generality  of  our  model,  however.  Since 
orphan  processes  are  possible  when  using  optimistic 
logging,  recovery  from  a  failure  is  more  difficult.  Any 
orphan  processes  must  be  rolled  back  during  recovery 
to  achieve  a  consistent  state.  Since  there  is  no  syn¬ 
chronization  between  message  logging,  checkpointing, 
and  computation,  information  for  the  recovery  state 
algorithm  may  arrive  on  stable  storage  at  any  time 
and  in  any  order.  Thus,  the  algorithm  must  be  able 
to  make  use  of  all  this  information  in  order  to  advance 
the  current  recovery  state  to  its  maximum  possible 
value  at  all  times. 

Our  model  and  algorithm  differ  in  several  ways 
from  those  used  for  optimistic  recovery  by  Strom 
and  Yemini  [9].  First,  Strom  and  Yemini  require 
reliable  delivery  of  messages  between  processes.  As 
a  result,  their  definition  of  consistency  differs  from 
ours  by  requiring  all  messages  sent  to  have  been  re¬ 
ceived.  Our  model  does  not  require  reliable  deliv¬ 
ery,  but  it  can  be  incorporated  easily  by  inserting 
a  return  acknowledgement  message  immediately  fol¬ 
lowing  each  message  receipt,  with  our  definition  of 
consistency  remaining  unchanged.  Second,  although 
their  system  checkpoints  processes  in  order  to  shorten 
recovery  times  and  release  old  logged  messages  from 
stable  storage,  they  do  not  take  advantage  of  these 
checkpoints  in  computing  the  current  maximum  re¬ 
coverable  state  in  their  system.  Our  algorithm  uses 
both  checkpoints  and  logged  messages  to  compute  the 
maximum  recoverable  state  and  thus  may  find  recov¬ 
erable  states  that  their  algorithm  does  not.  Finally, 
our  algorithm  requires  only  the  current  state  inter¬ 
val  index  of  the  sending  process  to  be  carried  in  each 
message,  and  requires  only  a  vector  of  direct  depen¬ 
dencies  to  be  maintained  by  each  process.  In  contrast, 
their  method  requires  each  proce.  -  to  maintain  a  vec- 


tor  of  its  transitive  dependencies,  and  requires  each 
message  to  be  tagged  with  this  vector,  which  has  size 
linear  in  the  number  of  processes.  This  added  com¬ 
plexity  does  allow  control  of  recovery  in  their  system 
to  be  more  decentralized  than  in  ours. 

5  Conclusion 

From  a  performance  standpoint,  optimistic  message 
logging  protocols  appear  to  be  desirable.  They  seem 
to  constitute  the  right  performance  tradeoff  in  operat¬ 
ing  environments  where  failures  are  rare  and  failure- 
free  performance  is  of  primary  concern.  The  recov¬ 
ery  state  algorithm  of  Section  3  represents  an  im¬ 
provement  on  earlier  work  with  recovery  based  on 
optimistic  message  logging  by  Strom  and  Yemini  [9]. 
Although  their  algorithm  eventually  achieves  a  recov¬ 
erable  state,  this  state  may  not  be  optimal.  Further¬ 
more,  their  methods  require  reliable  communication 
and  seem  more  complex  than  the  method  presented 
here. 

This  work  unifies  existing  approaches  to  fault  tol¬ 
erance  based  on  message  logging  and  checkpointing 
published  in  the  literature,  including  those  using  pes¬ 
simistic  message  logging  methods  [1,6,3]  and  those 
using  optimistic  methods  [9],  By  using  this  model 
to  reason  about  these  types  of  fault-tolerance  recov¬ 
ery  methods,  properties  that  are  independent  of  the 
message  logging  protocol  used  can  be  deduced  and 
proven.  We  have  shown  that  there  is  always  a  unique 
maximum  recoverable  system  state,  which  never  de¬ 
creases,  and  that  in  a  system  where  all  messages  re¬ 
ceded  are  eventually  logged,  the  domino  effect  cannot 
occur.  The  use  of  this  general  model  allows  more 
attention  to  be  paid  instead  to  designing  efficient 
message  logging  protocols. 
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